Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [525]:
# Installing the libraries with the specified version.
!pip3 install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [526]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Importing scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier  # Example for classifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

Loading the dataset¶

In [527]:
data = pd.read_csv('Loan_Modelling.csv');
data.head()
Out[527]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1

Data Overview¶

  • Observations
  • Sanity checks
In [528]:
df = data.copy()
print(f"shape of data: {data.shape}")
print(f"{data.info()}")
print(f"data description:{data.describe().T}" )
if(data.isnull().sum().sum() == 0):
    print(f"There are no null values in the provide data" )
else:
    print(f"There are {data.isnull().sum().sum()} null value in data")
shape of data: (5000, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
None
data description:                     count          mean          std      min       25%  \
ID                  5000.0   2500.500000  1443.520003      1.0   1250.75   
Age                 5000.0     45.338400    11.463166     23.0     35.00   
Experience          5000.0     20.104600    11.467954     -3.0     10.00   
Income              5000.0     73.774200    46.033729      8.0     39.00   
ZIPCode             5000.0  93169.257000  1759.455086  90005.0  91911.00   
Family              5000.0      2.396400     1.147663      1.0      1.00   
CCAvg               5000.0      1.937938     1.747659      0.0      0.70   
Education           5000.0      1.881000     0.839869      1.0      1.00   
Mortgage            5000.0     56.498800   101.713802      0.0      0.00   
Personal_Loan       5000.0      0.096000     0.294621      0.0      0.00   
Securities_Account  5000.0      0.104400     0.305809      0.0      0.00   
CD_Account          5000.0      0.060400     0.238250      0.0      0.00   
Online              5000.0      0.596800     0.490589      0.0      0.00   
CreditCard          5000.0      0.294000     0.455637      0.0      0.00   

                        50%       75%      max  
ID                   2500.5   3750.25   5000.0  
Age                    45.0     55.00     67.0  
Experience             20.0     30.00     43.0  
Income                 64.0     98.00    224.0  
ZIPCode             93437.0  94608.00  96651.0  
Family                  2.0      3.00      4.0  
CCAvg                   1.5      2.50     10.0  
Education               2.0      3.00      3.0  
Mortgage                0.0    101.00    635.0  
Personal_Loan           0.0      0.00      1.0  
Securities_Account      0.0      0.00      1.0  
CD_Account              0.0      0.00      1.0  
Online                  1.0      1.00      1.0  
CreditCard              0.0      1.00      1.0  
There are no null values in the provide data

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [529]:
#1
print(f"*** What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution? ***")
mortgage_dist = df['Mortgage']
plt.figure(figsize=(10,8))
sns.histplot(data=df, x=mortgage_dist)
plt.figure(figsize=(10,8))
sns.boxplot(data=df, x=mortgage_dist)
mortgage_dist_atzero = mortgage_dist[mortgage_dist == 0]
print(f"a. There are {mortgage_dist_atzero.shape[0]} customers that have no mortgage or the mortgage at 0 leading most of the data to be present at the first quartile")
mortgage_dist_outliers = mortgage_dist[mortgage_dist > 250]
print(f"b. There are about {mortgage_dist_outliers.shape[0]} outliers in mortgage out of {mortgage_dist.shape[0]} rows leading to approximately {(mortgage_dist_outliers.shape[0]/mortgage_dist.shape[0]) * 100:.2f}% of outliers in overall data")
print(f"c. Mortgage is right skewed")
*** What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution? ***
a. There are 3462 customers that have no mortgage or the mortgage at 0 leading most of the data to be present at the first quartile
b. There are about 299 outliers in mortgage out of 5000 rows leading to approximately 5.98% of outliers in overall data
c. Mortgage is right skewed
No description has been provided for this image
No description has been provided for this image
In [530]:
#2
customers = df['ID'].unique().shape[0]
customers
print(f"*** How many customers have credit cards? ***")
print(f"a. All the {df.shape[0]} customers are unique")
customers_with_cc = df[df['CreditCard'] == 1].shape[0]
print(f"b. There are {customers_with_cc} customer with credit card")
*** How many customers have credit cards? ***
a. All the 5000 customers are unique
b. There are 1470 customer with credit card
In [531]:
#3
print(f"***What are the attributes that have a strong correlation with the target attribute (personal loan)?***")
plt.figure(figsize=(10,8))
sns.heatmap(data=df.corr(),cbar=False,fmt='.2f',cmap='Spectral',vmin=-1,vmax=1,annot=True)
print(f"a. Personal_Loan have strong positive co-relation with Income, CCAvg and CD_Account at {df['Personal_Loan'].corr(df['Income']):.2f},{df['Personal_Loan'].corr(df['CCAvg']):.2f} and {df['Personal_Loan'].corr(df['CD_Account']):.2f} respectively")
***What are the attributes that have a strong correlation with the target attribute (personal loan)?***
a. Personal_Loan have strong positive co-relation with Income, CCAvg and CD_Account at 0.50,0.37 and 0.32 respectively
No description has been provided for this image
In [532]:
#4
print(f"***How does a customer's interest in purchasing a loan vary with their age?***")
print(f"a. From the heatmap we could already say that age and personal loan doesn't exhibit a great corelation")
plt.figure(figsize=(10,8))
sns.scatterplot(x=df['Personal_Loan'],y=df['Age'])
print("b. Customers at all ages have interest in buying personal loans")
***How does a customer's interest in purchasing a loan vary with their age?***
a. From the heatmap we could already say that age and personal loan doesn't exhibit a great corelation
b. Customers at all ages have interest in buying personal loans
No description has been provided for this image
In [533]:
#5
print(f"***How does a customer's interest in purchasing a loan vary with their education?***")
print(f"a. Education vs Personal loan")
for i in range(1,4):
    print(f"{i}. Education with value {i}: {(df[(df['Education'] == i) & (df['Personal_Loan'] == 1)]).shape[0]} of {df[df['Education'] == i].shape[0]} have accepted personal loan")
print(f"b. Customers who have Advanced/Professional education are higher probable of accepting the personal loan")
***How does a customer's interest in purchasing a loan vary with their education?***
a. Education vs Personal loan
1. Education with value 1: 93 of 2096 have accepted personal loan
2. Education with value 2: 182 of 1403 have accepted personal loan
3. Education with value 3: 205 of 1501 have accepted personal loan
b. Customers who have Advanced/Professional education are higher probable of accepting the personal loan
In [534]:
#6
print(f"***How does a customer's interest in purchasing a loan vary with their Income?***")
pl_vs_income = df[df['Personal_Loan'] == 1]
print(f"a. The lowest income where the customer accepted the loan is {pl_vs_income['Income'].min()}k which indicates that customers with income less than 60k have no interest in accepting personal loans")
print(f"b. Average income of the customer who accepted personal loan is {pl_vs_income['Income'].describe().T['mean']:.2f}k")
# Scatter Plot using Seaborn
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x='Income', y='Personal_Loan', hue='Personal_Loan', palette={1: 'blue', 0: 'red'})

# Labels and Title
plt.xlabel('Income')
plt.ylabel('Loan Status')
plt.title('Income vs Loan Status')
plt.show()
***How does a customer's interest in purchasing a loan vary with their Income?***
a. The lowest income where the customer accepted the loan is 60k which indicates that customers with income less than 60k have no interest in accepting personal loans
b. Average income of the customer who accepted personal loan is 144.75k
No description has been provided for this image
In [535]:
#7
print(f"***How does a customer's interest in purchasing a loan vary with their CCAvg?***")
pl_vs_ccavg = df[df['Personal_Loan'] == 1]
print(f"b. Customers who accepted personal loan have an average credit card amount of {pl_vs_ccavg['CCAvg'].describe().T['mean']:.2f}k")
# Scatter Plot using Seaborn
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x='CCAvg', y='Personal_Loan', hue='Personal_Loan', palette={1: 'blue', 0: 'red'})

# Labels and Title
plt.xlabel('CCAvg')
plt.ylabel('Loan Status')
plt.title('CCAvg vs Loan Status')
plt.show()
***How does a customer's interest in purchasing a loan vary with their CCAvg?***
b. Customers who accepted personal loan have an average credit card amount of 3.91k
No description has been provided for this image
In [536]:
#8 pairplot to identify relations better
plt.figure(figsize=(12,8))
sns.pairplot(df, hue='Personal_Loan',diag_kind='kde')
print(f"Income, Age, Mortatage and CCAvg clearly shows a strong corelation with personal loan acceptance")
Income, Age, Mortatage and CCAvg clearly shows a strong corelation with personal loan acceptance
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [537]:
#Missing value treatment
print(f"There are {df.isnull().sum().sum()} null values in the provided data")
print(f"{df.nunique()}")
#dropping ID column as there there are 5000 unique values with in 5000 rows
df = df.drop(columns=['ID'])
There are 0 null values in the provided data
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
dtype: int64
In [538]:
#Feature engineering
#checking if zipcode or area customers live in affecting the acceptance of loan due to other parameters outside the provided data
print(f"Unique zipcode values: {df['ZIPCode'].nunique()}")
#corelation between zipcode and personal loan
print(f"correaltion between personal loan and ZIPCode{df['Personal_Loan'].corr(df['ZIPCode']):.2f}")
#Dropping the column zipcode as the corelation is too low for this feature to be considered
df = df.drop(columns=['ZIPCode'])
Unique zipcode values: 467
correaltion between personal loan and ZIPCode-0.00
In [539]:
#Column Expereince have high corelation with Age which directly related to the redundancy, dropping the column experience as well
df = df.drop(columns=['Experience'])
In [540]:
#Data prepeation for model

# Define features (X) and target (y)
X = df.drop(columns=['Personal_Loan'])  # Drop the target column
y = df['Personal_Loan']                 # Target column

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Print the shapes of the resulting splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
print('Percentage of classes in train set')
print(100*y_train.value_counts(normalize=True), '\n')
print('Percentage of classes in test set')
print(100*y_test.value_counts(normalize=True), '\n')
print(f"Balance of classes are same in train and test sets")
X_train shape: (4000, 10)
X_test shape: (1000, 10)
y_train shape: (4000,)
y_test shape: (1000,)
Percentage of classes in train set
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64 

Percentage of classes in test set
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64 

Balance of classes are same in train and test sets

Model Evaluation Criterion¶

Model Building¶

Reusable functions

Functions for model performance, confusion matrix, plotting decision tree, plotting text tree

In [541]:
#function for model performance calculation
#input parameter:(model, X_train/X_test, y_prediction_train/y_prediction_test)
def model_performance_classification(model, predictors, target):
    """
    Function to calculate all the model performance metrics

    model: classifier
    predictors: independent variables (X)
    target: dependent variable (y)

    """

    #prediction using predictors
    pred = model.predict(predictors)

    model_accuracy = accuracy_score(target, pred)
    model_recall = recall_score(target, pred)
    model_precision = precision_score(target,pred)
    model_f1_score = f1_score(target,pred)

    #creating dataframe of all the metrics
    df_model_preformance = pd.DataFrame({
        'Accuracy': model_accuracy,
        'Recall': model_recall,
        'Precision': model_precision,
        'F1': model_f1_score
    },
    index=[0])

    return df_model_preformance
In [542]:
#function for enabling confusion matrix
#input parameter:(model, X_train/X_test, y_prediction_train/y_prediction_test)

def plot_confusion_matrix(model, predictors, target):
    """
    To calculate the confusion matrix with percentages

    model: classifier
    predictors: independent variables (X)
    target: dependent variable (y)
    """

    #predicting the target values
    y_pred = model.predict(predictors)

    #creating the confusion matrix
    cm = confusion_matrix(target, y_pred)

    #creating labels
    labels = np.asarray([
        ["{0:0.0f}".format(item) + '\n{0:.2%}'.format(item/cm.flatten().sum())]
        for item in cm.flatten()
    ]).reshape(2,2) #to matrix

    #figure size for confusion matrix
    plt.figure(figsize=(10,8))
    sns.heatmap(cm,annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted lable')
In [565]:
#function to plot the decision tree for given model
#input parameter:(model)
from sklearn import tree

def plot_decision_tree(model):
    feature_names = list(X_train.columns)
#setting figure size
    plt.figure(figsize=(20,20))
#plotting the decision tree
    out = tree.plot_tree(model, feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)

    for o in out:
        arrow = o.arrow_patch
        if arrow is not None:
            arrow.set_edgecolor('black')
            arrow.set_linewidth(1)

    print(f"Tree Depth: {model.get_depth()}")
    print(f"Number of Leaves: {model.get_n_leaves()}")
plt.show()
In [544]:
# function for text report on the model
#input parameter:(model)
def plot_text_tree(model):
    print(
            tree.export_text(
                model,
                feature_names=list(X_train.columns),
                show_weights=True
            )
        )

Decision Tree from sklearn model

dtree_model_one | decision tree plot | tree text plot | confusion matrix | performance matrix

In [545]:
#initializing the decision tree and fitting the training data
dtree_model_one = DecisionTreeClassifier(random_state=42)
dtree_model_one.fit(X_train, y_train)
Out[545]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
In [546]:
#Plotting the decision tree for model one
plot_decision_tree(dtree_model_one)
Tree Depth: 13
Number of Leaves: 66
No description has been provided for this image
In [547]:
#Plotting the decision tree text for model one
plot_text_tree(dtree_model_one)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 27.00
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Age >  27.00
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- Income <= 63.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Income >  63.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |   |   |   |--- weights: [29.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- Mortgage <= 249.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  249.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |--- weights: [47.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |--- weights: [19.00, 0.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |--- Age <= 31.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  31.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- Mortgage <= 38.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  38.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Family >  1.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 33.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  33.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [506.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- CCAvg <= 1.45
|   |   |   |   |   |   |--- Age <= 40.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  40.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  1.45
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [36.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 1.70
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.55
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  1.55
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  1.70
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Age >  57.50
|   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |--- Age <= 40.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  40.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- Mortgage <= 265.50
|   |   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  265.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 244.00] class: 1

In [548]:
# Confusion matrix calculation for model one training data
plot_confusion_matrix(dtree_model_one, X_train, y_train)
No description has been provided for this image
In [549]:
#Perfomance calculations for model one
dtree_model_one_train_perfromance = model_performance_classification(dtree_model_one, X_train, y_train)
dtree_model_one_train_perfromance
Out[549]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [550]:
#plotting confusion matrix for test data
plot_confusion_matrix(dtree_model_one, X_test, y_test)
No description has been provided for this image
In [551]:
#performance metrics for model one using test data
dtree_model_one_test_perfromance = model_performance_classification(dtree_model_one,X_test,y_test)
dtree_model_one_test_perfromance
Out[551]:
Accuracy Recall Precision F1
0 0.983 0.947917 0.883495 0.914573

Model Performance Improvement¶

Tree depth analysis for pre pruning

In [552]:
#Tree depth analysis
depths = range(1, 14) #As the tree depth is 13 from the above statement
train_scores = []
test_scores = []

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, tree.predict(X_train)))
    test_scores.append(accuracy_score(y_test, tree.predict(X_test)))

# Plot train vs. test accuracy
plt.figure(figsize=(10, 6))
plt.plot(depths, train_scores, label='Train Accuracy', marker='o')
plt.plot(depths, test_scores, label='Test Accuracy', marker='o')
plt.xlabel('Tree Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Train vs Test Accuracy by Tree Depth')
plt.show()
No description has been provided for this image
In [553]:
#Analysis for max_depth, max_leaf_nodes and min_samples
max_depth_values = np.arange(2,13,2) #Excluding root node, and excluding final node
max_leaf_nodes_values = np.arange(10,51,10) #Generalized value
min_sample_split_values = np.arange(10,66,10) #

best_estimator = None
best_score_diff = float('inf')

for max_depth in max_depth_values:
    for max_leaf in max_leaf_nodes_values:
        for min_sample in min_sample_split_values:
            estimator = DecisionTreeClassifier(
                max_depth= max_depth,
                max_leaf_nodes= max_leaf,
                min_samples_split=min_sample,
                random_state=42
            )

            estimator.fit(X_train,y_train)

            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            train_f1_score = f1_score(y_train, y_train_pred)
            test_f1_score = f1_score(y_test, y_test_pred)
            
            score_diff = abs(train_f1_score - test_f1_score)

            if score_diff < best_score_diff:
                best_score_diff = score_diff
                best_estimator = estimator

Decision Tree Pre prune

dtree_model_two | decision tree plot | tree text plot | confusion matrix | performance matrix

In [557]:
#Determning the best fit model by iterating max_depth, max)leaf_nodes and min_sample_split for training data set
dtree_model_two = best_estimator
dtree_model_two.fit(X_train,y_train)
Out[557]:
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=40, min_samples_split=20,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=40, min_samples_split=20,
                       random_state=42)
In [566]:
#plotting tree with pre-pruned model
plot_decision_tree(dtree_model_two)
Tree Depth: 10
Number of Leaves: 32
No description has been provided for this image
In [567]:
#plotting tree with pre-pruned model
plot_text_tree(dtree_model_two)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 27.00
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Age >  27.00
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.00, 2.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |   |   |   |--- weights: [29.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |   |   |   |--- weights: [9.00, 1.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- weights: [9.00, 6.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- weights: [1.00, 2.00] class: 1
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |--- weights: [47.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |--- weights: [19.00, 0.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- weights: [9.00, 7.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 10.00] class: 1
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- weights: [2.00, 2.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- weights: [3.00, 3.00] class: 0
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [506.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- weights: [12.00, 6.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [36.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |   |--- weights: [14.00, 1.00] class: 0
|   |   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [13.00, 5.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Age >  57.50
|   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |--- weights: [2.00, 3.00] class: 1
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |--- CCAvg <= 3.45
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.45
|   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [8.00, 6.00] class: 0
|   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- weights: [2.00, 7.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 244.00] class: 1

In [568]:
#plotting confusion matrix for the newly derived model pre-prune for train data
plot_confusion_matrix(dtree_model_two, X_train, y_train)
No description has been provided for this image
In [569]:
#Performance calculation for the model with pre pruning technique using train data
dtree_model_two_train_performance = model_performance_classification(dtree_model_two, X_train, y_train)
dtree_model_two_train_performance
Out[569]:
Accuracy Recall Precision F1
0 0.98825 0.898438 0.977337 0.936228
In [570]:
#plotting confusion matrix for the newly derived model pre-prune for test data
plot_confusion_matrix(dtree_model_two, X_test, y_test)
No description has been provided for this image
In [571]:
#Performance calculation for second model with pre pruning technique using test data
dtree_model_two_test_performance = model_performance_classification(dtree_model_two, X_test, y_test)
dtree_model_two_test_performance
Out[571]:
Accuracy Recall Precision F1
0 0.989 0.9375 0.947368 0.942408

Decision Tree Post prune

Deriving alphas | impurities | node counts | depth

In [572]:
#Applying post pruning technique to reduce the branches in the nodes
model_post_prune = DecisionTreeClassifier(random_state=42)
#cost complexity calculation
path = model_post_prune.cost_complexity_pruning_path(X_train, y_train)
#extracting the alphas
ccp_alphas = abs(path.ccp_alphas)
#extracting the impurities
impurities = path.impurities
In [573]:
pd.DataFrame(path)
Out[573]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000233 0.000467
2 0.000244 0.000954
3 0.000246 0.001446
4 0.000300 0.002046
5 0.000306 0.002965
6 0.000331 0.003958
7 0.000333 0.004291
8 0.000333 0.004624
9 0.000333 0.004958
10 0.000350 0.006008
11 0.000373 0.007499
12 0.000375 0.007874
13 0.000381 0.008255
14 0.000400 0.008655
15 0.000406 0.009874
16 0.000419 0.012390
17 0.000455 0.012845
18 0.000461 0.015148
19 0.000493 0.016133
20 0.000579 0.020187
21 0.000584 0.020771
22 0.000779 0.021550
23 0.000823 0.022373
24 0.000831 0.023204
25 0.000870 0.024945
26 0.002424 0.027369
27 0.002667 0.030036
28 0.003000 0.033036
29 0.003753 0.036789
30 0.020023 0.056812
31 0.021549 0.078361
32 0.047604 0.173568
In [574]:
#creating plot alphas
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1],impurities[:-1],marker='o',drawstyle='steps-post')
ax.set_xlabel('Effective alpha')
ax.set_ylabel('Total Impurities of leaves')
ax.set_title('Total Impurities vs Effective Alpha for train set')
Out[574]:
Text(0.5, 1.0, 'Total Impurities vs Effective Alpha for train set')
No description has been provided for this image
In [575]:
#Creating the array of DecisionTreeClassifiers using each ccp_alpha value derived from post pruning method
clfs = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    clf.fit(X_train,y_train)
    clfs.append(clf)
In [576]:
#Plotting Alpha vs Node count and Alpha vs Tree depth to identify the growth of tree

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]

fig, ax = plt.subplots(2,1, figsize=(12,8))

ax[0].plot(ccp_alphas,node_counts,marker='o',drawstyle='steps-post')
ax[0].set_xlabel('Alphas')
ax[0].set_ylabel('Node count')
ax[0].set_title('Alpha vs Node Count')

ax[1].plot(ccp_alphas,depth,marker='o',drawstyle='steps-post')
ax[1].set_xlabel('Alphas')
ax[1].set_ylabel('Tree deptht')
ax[1].set_title('Alpha vs Tree depth')

fig.tight_layout()
No description has been provided for this image
In [577]:
#Creating the array for f1 scores for each alpha using the train data set
train_f1_scores = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    f1_train = f1_score(y_train,pred_train)
    train_f1_scores.append(f1_train)
In [578]:
#Creating the array for f1 scores for each alpha using the test data set
test_f1_scores = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    f1_test = f1_score(y_test, pred_test)
    test_f1_scores.append(f1_test)
In [579]:
#Plotting Alpha vs f1 scores for both train and test data
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel('Alpha')
ax.set_ylabel('F1 score')
ax.set_title('F1 score vs Alpha for training and test set')
ax.plot(ccp_alphas, train_f1_scores, marker='o', drawstyle='steps-post')
ax.plot(ccp_alphas, test_f1_scores, marker='o', drawstyle='steps-post')
ax.legend()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Out[579]:
<matplotlib.legend.Legend at 0x32deb01d0>
No description has been provided for this image
In [580]:
#Identifying the best post pruning model from the derived test scores
index_best_model = np.argmax(test_f1_scores)

dtree_model_three | decision tree plot | tree text plot | confusion matrix | performance matrix

In [581]:
#Assigning the best model identified to a model variable for further calculations
dtree_model_three = clfs[index_best_model]
dtree_model_three
Out[581]:
DecisionTreeClassifier(ccp_alpha=0.0008702884311333967, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0008702884311333967, random_state=42)
In [582]:
#Plotting the decision tree for the new best model post pruned.
plot_decision_tree(dtree_model_three)
Tree Depth: 4
Number of Leaves: 9
No description has been provided for this image
In [583]:
#Plotting the decision tree for the new best model post pruned.
plot_text_tree(dtree_model_three)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- weights: [136.00, 21.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 10.00] class: 1
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [531.00, 5.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- weights: [12.00, 6.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- weights: [75.00, 12.00] class: 0
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- weights: [12.00, 25.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [2.00, 251.00] class: 1

In [584]:
#Plotting the confusion matrix for the post pruned best model using training data set
plot_confusion_matrix(dtree_model_three, X_train, y_train)
No description has been provided for this image
In [585]:
#Plotting the confusion matrix for the post pruned best model using test data set
plot_confusion_matrix(dtree_model_three, X_test, y_test)
No description has been provided for this image
In [586]:
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using train data
dtree_model_three_train_perfromance = model_performance_classification(dtree_model_three, X_train,y_train)
print(dtree_model_three_train_perfromance)
   Accuracy    Recall  Precision        F1
0   0.98475  0.885417   0.952381  0.917679
In [587]:
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using test data
dtree_model_three_test_performance = model_performance_classification(dtree_model_three, X_test,y_test)
print(dtree_model_three_test_performance)
   Accuracy    Recall  Precision        F1
0     0.991  0.958333   0.948454  0.953368

Decision Tree Post pruning using pre prune model

Deriving alphas | impurities | node counts | depth

In [588]:
#post pruning the pre pruned tree to see for better fit of a model
path_pre_post = dtree_model_two.cost_complexity_pruning_path(X_train, y_train)
#extracting the alphas
ccp_alphas_pre_post = abs(path_pre_post.ccp_alphas)
#extracting the impurities
impurities_pre_post = path_pre_post.impurities
In [589]:
#creating a data frame to display ccp_alphas_pre_post and impurities associated with them in the nodes
pd.DataFrame(path_pre_post)
Out[589]:
ccp_alphas impurities
0 0.000000 0.015098
1 0.000037 0.015135
2 0.000141 0.015276
3 0.000214 0.015491
4 0.000246 0.015983
5 0.000343 0.016326
6 0.000364 0.016690
7 0.000371 0.017432
8 0.000372 0.018175
9 0.000429 0.019891
10 0.000485 0.020376
11 0.000584 0.020960
12 0.000585 0.023300
13 0.000822 0.024945
14 0.002424 0.027369
15 0.002667 0.030036
16 0.003000 0.033036
17 0.003753 0.036789
18 0.020023 0.056812
19 0.021549 0.078361
20 0.047604 0.173568
In [590]:
#creating plot alphas
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas_pre_post[:-1],impurities_pre_post[:-1],marker='o',drawstyle='steps-post')
ax.set_xlabel('Effective alpha')
ax.set_ylabel('Total Impurities of leaves')
ax.set_title('Total Impurities vs Effective Alpha for train set')
Out[590]:
Text(0.5, 1.0, 'Total Impurities vs Effective Alpha for train set')
No description has been provided for this image
In [591]:
#Creating an array with alphas that were derived from post pruning using the pre-pruned model
clfs_pre_post = []
for ccp_alpha in ccp_alphas_pre_post:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    clf.fit(X_train,y_train)
    clfs_pre_post.append(clf)
In [592]:
#Plotting Alpha vs Node count and Alpha vs Tree depth to identify the growth of tree

clfs_pre_post = clfs_pre_post[:-1]
ccp_alphas_pre_post = ccp_alphas_pre_post[:-1]


node_counts_pre_post = [clf_pre_post.tree_.node_count for clf_pre_post in clfs_pre_post]
depth_pre_post = [clf_pre_post.tree_.max_depth for clf_pre_post in clfs_pre_post]

fig, ax = plt.subplots(2,1, figsize=(12,8))

ax[0].plot(ccp_alphas_pre_post,node_counts_pre_post,marker='o',drawstyle='steps-post')
ax[0].set_xlabel('Alphas')
ax[0].set_ylabel('Node count')
ax[0].set_title('Alpha vs Node Count')

ax[1].plot(ccp_alphas_pre_post,depth_pre_post,marker='o',drawstyle='steps-post')
ax[1].set_xlabel('Alphas')
ax[1].set_ylabel('Tree deptht')
ax[1].set_title('Alpha vs Tree depth')

fig.tight_layout()
No description has been provided for this image
In [593]:
#Creating the array for f1 scores for each alpha using the train data set
train_f1_scores_pre_post = []
for clf in clfs_pre_post:
    pred_train = clf.predict(X_train)
    f1_train = f1_score(y_train,pred_train)
    train_f1_scores_pre_post.append(f1_train)
In [594]:
#Creating the array for f1 scores for each alpha using the test data set
test_f1_scores_pre_post = []
for clf in clfs_pre_post:
    pred_test = clf.predict(X_test)
    f1_test = f1_score(y_test, pred_test)
    test_f1_scores_pre_post.append(f1_test)
In [595]:
#Plotting Alpha vs f1 scores for both train and test data
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel('Alpha')
ax.set_ylabel('F1 score')
ax.set_title('F1 score vs Alpha for training and test set')
ax.plot(ccp_alphas_pre_post, train_f1_scores_pre_post, marker='o', drawstyle='steps-post')
ax.plot(ccp_alphas_pre_post, test_f1_scores_pre_post, marker='o', drawstyle='steps-post')
ax.legend()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Out[595]:
<matplotlib.legend.Legend at 0x354db9d60>
No description has been provided for this image
In [596]:
#Identifying the best post pruning model from the derived test scores
index_best_model_pre_post = np.argmax(test_f1_scores_pre_post)

dtree_model_four | decision tree plot | tree text plot | confusion matrix | performance matrix

In [597]:
#Assigning the best model identified to a model variable for further calculations
dtree_model_four = clfs[index_best_model_pre_post]
dtree_model_four
Out[597]:
DecisionTreeClassifier(ccp_alpha=0.0003999999999999999, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0003999999999999999, random_state=42)
In [598]:
#Plotting the decision tree for the new best model post pruned.
plot_decision_tree(dtree_model_four)
Tree Depth: 13
Number of Leaves: 39
No description has been provided for this image
In [599]:
#Plotting the decision tree for the new best model post pruned.
plot_text_tree(dtree_model_four)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2845.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 92.50
|   |   |   |   |--- Age <= 27.00
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Age >  27.00
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- weights: [45.00, 3.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 4.00] class: 1
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- weights: [1.00, 2.00] class: 1
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- weights: [66.00, 1.00] class: 0
|   |   |   |--- Income >  92.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- weights: [3.00, 7.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- weights: [3.00, 1.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Family >  1.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- weights: [529.00, 3.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- weights: [2.00, 5.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.95
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [36.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |   |--- weights: [14.00, 1.00] class: 0
|   |   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Age >  57.50
|   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |--- weights: [2.00, 3.00] class: 1
|   |   |   |--- CCAvg >  2.95
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- Mortgage <= 265.50
|   |   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  265.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [2.00, 251.00] class: 1

In [600]:
#Plotting the confusion matrix for the post pruned best model using training data set
plot_confusion_matrix(dtree_model_four, X_train, y_train)
No description has been provided for this image
In [601]:
#Plotting the confusion matrix for the post pruned best model using test data set
plot_confusion_matrix(dtree_model_four, X_test, y_test)
No description has been provided for this image
In [602]:
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using train data
dtree_model_four_train_perfromance = model_performance_classification(dtree_model_four, X_train,y_train)
print(dtree_model_four_train_perfromance)
   Accuracy    Recall  Precision        F1
0   0.99475  0.976562   0.968992  0.972763
In [603]:
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using test data
dtree_model_four_test_performance = model_performance_classification(dtree_model_four, X_test,y_test)
print(dtree_model_four_test_performance)
   Accuracy    Recall  Precision    F1
0     0.982  0.947917      0.875  0.91

Model Performance Comparison and Final Model Selection¶

In [604]:
#Comparing the model performances for train data which were derived earlier
models_train_comparison_df = pd.concat(
    [
        dtree_model_one_train_perfromance.T,
        dtree_model_two_train_performance.T,
        dtree_model_three_train_perfromance.T,
        dtree_model_four_train_perfromance.T
    ],
    axis=1
)

models_train_comparison_df.columns = [
    "Normal Decision Tree",
    "Prepruned Decision Tree",
    "Post pruned Decision Tree",
    "Pre Post pruned Decision Tree"
]

models_train_comparison_df
Out[604]:
Normal Decision Tree Prepruned Decision Tree Post pruned Decision Tree Pre Post pruned Decision Tree
Accuracy 1.0 0.988250 0.984750 0.994750
Recall 1.0 0.898438 0.885417 0.976562
Precision 1.0 0.977337 0.952381 0.968992
F1 1.0 0.936228 0.917679 0.972763
In [605]:
#Comparing the model performances for test data which were derived earlier
models_test_comparison_df = pd.concat(
    [
        dtree_model_one_test_perfromance.T,
        dtree_model_two_test_performance.T,
        dtree_model_three_test_performance.T,
        dtree_model_four_test_performance.T
    ],
    axis=1
)

models_test_comparison_df.columns = [
    "Normal Decision Tree",
    "Prepruned Decision Tree",
    "Post pruned Decision Tree",
    "Pre Post pruned Decision Tree"
]

models_test_comparison_df
Out[605]:
Normal Decision Tree Prepruned Decision Tree Post pruned Decision Tree Pre Post pruned Decision Tree
Accuracy 0.983000 0.989000 0.991000 0.982000
Recall 0.947917 0.937500 0.958333 0.947917
Precision 0.883495 0.947368 0.948454 0.875000
F1 0.914573 0.942408 0.953368 0.910000
In [606]:
scores_diff = models_test_comparison_df - models_train_comparison_df
scores_diff
Out[606]:
Normal Decision Tree Prepruned Decision Tree Post pruned Decision Tree Pre Post pruned Decision Tree
Accuracy -0.017000 0.000750 0.006250 -0.012750
Recall -0.052083 0.039062 0.072917 -0.028646
Precision -0.116505 -0.029969 -0.003927 -0.093992
F1 -0.085427 0.006180 0.035689 -0.062763

From the score difference above we can clearly conclude that post pruned decision tree have the best scores in both accuracy and F1 compared to other models

Actionable Insights and Business Recommendations¶

Technical conclusions from model

  1. Customers with an income greater than 98.5 (thousand dollars) are more likely to accept personal loans.
  2. Customers with higher credit card spending averages (CCAvg > 2.95) are more likely to take personal loans.
  3. Customers with a CD account (CD_Account > 0.5) and high credit card spending (CCAvg > 2.95) are more likely to accept loans (10 out of 13 instances in the corresponding node are positive).
  4. Customers with a family size greater than 2.5 who have moderate income (98.5 < Income <= 113.5) are less likely to accept loans (12 out of 18 instances are in the negative class). However, customers with a family size greater than 2.5 and higher income (Income > 113.5) are significantly more likely to accept loans (54 out of 54 instances are positive).
  5. Customers with advanced education (Education > 1.5) are more likely to take personal loans compared to those with lower education levels.
  6. Customers with a combination of high income, high credit card spending, and CD accounts are the strongest candidates for accepting loans.
  7. Customers with low income (Income <= 98.5) and low credit card spending (CCAvg <= 2.95) are highly unlikely to accept personal loans.
  • What recommedations would you suggest to the bank?
  1. Target high-income customers in the bank’s marketing campaigns for personal loans. • Prioritize customers with incomes exceeding 114.5, as they are predominantly in the positive class of accepting personal loan.
  2. Market personal loans specifically to CD account holders with high credit card spending. • Bundle personal loan offers with CD promotions or discounts to encourage loan acceptance.
  3. For families larger than 2.5, prioritize targeting those with higher income brackets. • Target customers with graduate or advanced degrees with marketing messages highlighting financial literacy, investment opportunities, and tailored loan products. Consider advertising for professional development or higher eduction.